This data set contains examples of 4,898 white wines of the Portugese “Vinho Verde” variety, with 11 objective variables that quantify the chemical properties of each wine, and 1 subjective variable that provides a rating of the quality of each wine (between 0 (very bad) and 10 (very excellent); rated by at least 3 experts). The data set was obtained via Udacity’s Data Analyst Nanodegree Program. For more detailed information, see this text file.
To start to understand the data set, the first thing that I wanted to do was print out the structure of the data. Are there any variables that need to be changed to factors?
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The only variable that really stood out to me as needing to be a factor rather than an integer is quality, as this is a categorial variable. So turn it to a factor, then re-check the structure.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
So there are 7 levels to quality, let’s see all of them:
## [1] "3" "4" "5" "6" "7" "8" "9"
This is saying that there are only 7 levels of quality in the dataset, from 3 through 9, although according to the dataset overview, the experts had a choice from 1 to 10, meaning that no wines were labeled as a 1, 2 or 10 despite those being choices. Let’s see how distributed the scores were with a bar graph:
So it looks like a fairly normal distribution with a peak around 6. Let’s see the actual numbers:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Looks good. Now let’s move on to the objective variables in the data set, starting out with displaying the summary:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality
## Min. : 8.00 3: 20
## 1st Qu.: 9.50 4: 163
## Median :10.40 5:1457
## Mean :10.51 6:2198
## 3rd Qu.:11.40 7: 880
## Max. :14.20 8: 175
## 9: 5
And we’ll take a quick peek at what the data actually looks like:
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
Now, let’s plot these variables, starting with fixed acidity, measured as tartaric acid in g/dm^3: Using a binwidth with the specificity of the measurement (0.01):
This variable seems to be fairly nomally distributed, however it is long-tailed in that there seems to be a couple of outliers in the 11+ range. Let’s use a binwidth that isn’t as noisy, but still shows the pattern in the data, and limit the x axis to see only the bulk of the data:
I wonder how many wines there are with a fixed acidity over 11 g/dm^3?
##
## FALSE TRUE
## 4896 2
Ok, so out of 4898 wines, only 2 had a fixed acidity over 11 g/dm^3. I wonder how well they were rated?
## [1] 6 3
## Levels: 3 4 5 6 7 8 9
Interesting, so one had an average rating and the other was rated poorly. Let’s move on to the next variable: volatile acidity, measured as acetic acid content in g/dm^3:
The plot for Volatile Acidity is also long-tailed, with many more outliers present, and once again is quite noisy using a binwidth that is as specific as the measurement. Let’s see the bulk of the data by using the limit argument and setting the binwidth properly:
I wonder how correlated fixed acidity and volatile acidity are?
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and volatile.acidity
## t = -1.5886, df = 4896, p-value = 0.1122
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.050671536 0.005312543
## sample estimates:
## cor
## -0.02269729
That’s very interesting, they are not really correlated at all… I wonder why that is? Is it because they are measuring different acids? Or are they still correlated, just not linearly??
Okay, moving on, from now on, using only the binwidth that shows the best variance vs bias trade-off, not the original binwidth based on increments in the scale of the data.
Next up, citric acid, measured in g/dm^3:
We can also look at this as a frequency polygon:
Once again, the data is long-tailed, with some outliers present:
##
## FALSE TRUE
## 4888 10
These 10 outliers (Citric Acid Level > 0.9 g/dm^3) had ratings of:
## [1] 6 5 6 6 6 6 6 6 6 6
## Levels: 3 4 5 6 7 8 9
Hmm, that is interesting that they were all considered to be in the “normal” range.
Next up: Residual Sugar, measured in g/dm^3:
Ok, this is really long-tailed. Let’s remove the tail:
This data is negatively skewed, meaning that most white wines tend to have residual sugar levels in the lower range (75% are under 9.9 g/dm^3 and 25% are from 9.9-65.8 g/dm^3), therefore most of the wines would be considered “dry” instead of “sweet”. According to winefolly.com, the “Vinho Verde” variety of wine is usually categorized under the “Bone Dry” level of the “Wine Sweetness Chart”, which is the driest of the wine categorizations, however this is dependant on the winemaker’s style, vintage and regional differences.
##
## FALSE TRUE
## 4880 18
Ok, there are 18 wines that had a Residual Sugar above 20 g/dm^3:
## [1] 6 6 5 5 6 5 6 6 5 6 5 5 5 6 6 6 6 5
## Levels: 3 4 5 6 7 8 9
And they were all rated at a 5 or 6. I wonder, if the experts were aware that this was a test of this particular variety, and were expecting a dry wine, if that means they rated sweeter wines more poorly than they might have if they were instead testing a wine like a Moscato that tends to be sweet?
Next, let’s visualize chlorides, measured as sodium chloride (aka salt) in g/dm^3:
Another really long-tailed plot, that seems to be a trend in our data.
Once again, a fairly normal distribution once the tail is cut off. Let’s look at Free Sulphur Dioxide, measured in mg/dm^3:
Once again, very long-tailed.
Moving on to Total Sulfur Dioxide, also measured in mg/dm^3:
I wonder how correlated free sulfur dioxide is to total sulfur dioxide?
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
Interesting, I thought they would be highly correlated, 0.62 is a “moderate” correlation, I wonder if we can still combine them though, we’ll make a new variable (prop.sulfur.dioxide)that is the ratio of free sulfur dioxide to total sulfur dioxide, and plot it:
That makes a nice normal distribution, and reduces the long tail. Let’s see it’s summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02362 0.19093 0.25368 0.25558 0.31579 0.71053
Next, let’s plot Density, measured in g/cm^3:
Once again, very long tailed. According to the data set description, the density of wine is close to that of water (1 gm/cm^3), but is dependant on the sugar and alcohol content, let’s see how density and sugar content are related:
##
## Pearson's product-moment correlation
##
## data: density and residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
And let’s see how density and alcohol content are related:
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Ok, so the residual sugar content and the density are highly correlated in a positive fashion, and the alcohol content and density are highly correlated in a negative fashion. That is definitely interesting to keep in mind as I continue through the bivariate and multivariate components of this investigation.
Next, let’s visualize pH (aka potential of hydrogen):
By OpenStax College - Anatomy & Physiology, Connexions Web site. http://cnx.org/content/col11496/1.6/, Jun 19, 2013., CC BY 3.0, Link
So this pH is considered acidic, and falls within the range of other common beverages such as soda, grapefruit juice and tomato juice.
The next variable is Sulphates, measured as potassium sulphate in g/dm^3:
According to the data description, Sulphates contribute to the Free and Total Sulfur Dioxide contents:
##
## Pearson's product-moment correlation
##
## data: sulphates and free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03126264 0.08707928
## sample estimates:
## cor
## 0.05921725
##
## Pearson's product-moment correlation
##
## data: sulphates and total.sulfur.dioxide
## t = 9.5019, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1069590 0.1619585
## sample estimates:
## cor
## 0.1345624
##
## Pearson's product-moment correlation
##
## data: sulphates and prop.sulfur.dioxide
## t = -1.5651, df = 4896, p-value = 0.1176
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05033679 0.00564813
## sample estimates:
## cor
## -0.02236186
So that’s surprising that the Sulphate level is not correlating with Free Sulfur Dioxide, Total Sulfur Dioxide or the Proportion of Free to Total Sulfur Dioxide. Once again, is this maybe because they’re just not linearly correlated?
And lastly, Alcohol content, measured as a percentage of total volume:
The histogram for alcohol content is mildly positively skewed, with a peak at ~9.5% alcohol and a range from 8%-14.2%. According to wikipedia, wine typically has an alcohol content of 9%-16%, most often 12.5% - 14.5%. According to the summary statistics above, the interquartile range for our dataset is 9.5%-11.4%, with a mean of 10.5%.
The dataset contains 4898 examples of “Vinho Verde” Portugese White Wines, and describes 10 objective variables of each wine, including acidity, sweetness and alcohol content (among others). It also contains an “expert rating” of each wine. Most wines were rated as “normal”, and no wines were rated as “very poor” (Rating of 1 or 2) or “Perfect” (Rating of 10).
The outcome feature of wine quality is the main feature of interest in the dataset. Other features of interest are alcohol content and Residual Sugars, as those features are the most discernable by the consumer (alcohol contents are sometimes labelled, and the sweetness level is very distinguishable)
There are many features that would be of interest that are missing from this dataset, namely features that the consumer would be able to use to distinguish one wine from another. Unfortunately, due to legal/privacy issues, the publishers of the dataset were unable to include these distinguishing factors. If this was not of concern, having the name of the wine, producer, vintage, grape types, selling price, etc. would make this data useable to the common consumer, as doing physiochemical composition analysis on the wines at the store is not feasible.
In the process of univariate exploration, I decided to create a new variable that is the ratio of free sulfur dioxide to total sufur dioxide in the wines. This resulted in a feature that is normally distributed.
There were a few distributions that had long (and very long) tails, for which I decided to set the x limits to show the bulk of the data only. Otherwise the distributions were explained by the context of the data, therefore, I did not change the form of the underlying data. As this data set is already a “tidy” data set, I would not expect to have to perform data wrangling on it.
Now I’m going to start to look at relationships between variables, building on the relationships I started to discover in the Univariate plots Section. First I’ll start off with a scatterplot matrix, removing X as a column as it is really just a placeholder. Since this visualization is hard to see, I’ll also export it as a pdf (scatterplot_matrix.pdf).
So many of the relationships that can be explained linearly, I already explored in the Univariate section (eg. Density with Residual Sugar and Alcohol Content or Free and Total Sulfur Dioxide Contents). Let’s plot those relationships:
First, Density and Residual Sugar:
There seems to be more variance in Density for wines that have low residual sugar levels. It seems that wines that have close to zero residual sugar have the most variance, and I would think that this could be explained by the fermentation process of these wines. Wines that have larger amounts of sugar initially put in, but then are allowed to ferment until they have no residual sugar left, would then perceivably have higher alcohol contents and lower density vs. others that had lower initial sugar contents, and were also allowed to ferment until no residual sugars were left, which would then have less alcohol in comparison, and therefore a density closer to that of water. It would be interesting to have the data of initial sugars (before fermentation) to test this hypothesis.
Then Density and Alcohol Content:
Once again, there seems to be more variance in density in wines with lower alcohol contents. I think that this relationship would be interesting to see in a Multivariate plot between density, alcohol content and residual sugars (see next section).
Next let’s look at the relationship between Fixed and Volatile Acidity, to see if there is a non-linear pattern (as they have a low Pearson’s R score):
It doesn’t really look like there is any pattern, linear or otherwise. I don’t think that I’ll investigate this any further.
Next, let’s visualize the relationships between Sulphur, Free Sulfur Dioxide and Total Sulfur Dioxide:
There is definitely a linear relationship between Free and Total Sulfur Dioxide, however there is a lot of variance along the line, which is why the Pearson’s R coefficient was 0.6, or only moderately correlated.
Hmm, despite the fact that Sulphates are an additive that can contribute to the Sulfur Dioxide present, there seems to be no discernable pattern or relationship.
Moving on to the main feature: Quality Level
The boxplot between quality and alcohol content in the scatterplot matrix is interesting, lets’ take a closer look:
Very interesting, so it looks like wines with alcohol contents lower than 10.5% are most likely to be rated lower (from 3 to 5), although there are many outliers that have high alcohol content, but were rated as a five. The mean alcohol content increases with each increase in quality rating after 5. There are only two outliers where wines with low(er) alcohol content were rated highly and two outliers where wines with high(er) alcohol content were rated poorly.
Let’s see what this looks like with facet-wrapping:
This just really shows us that most of the wines fell within the 5-7 Quality Ranking range. Let’s see if allowing the axis range to be “free” gives us anymore insights:
This is interesting, it shows that the only wines that were rated a 9 had alcohol contents of 10%, 12% or 13%. No wines with alcohol contents of 8% were rated higher than a 5. No wines with an alcohol content of 14% were rated as less than a 5 (though none were ranked as a 9 either). It makes sense that from Quality Ranking 4 through 6, the distributions were positively skewed and from 7 through 9, the distributions were negatively skewed.
pH also seems promising from the scatterplot matrix:
There seems to be a slight trend, but I’m not entirely sure, lets group pH by quality level and run a summary:
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.035 3.215 3.188 3.325 3.550
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.070 3.160 3.183 3.280 3.720
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.120 3.230 3.219 3.330 3.590
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 3.280 3.280 3.308 3.370 3.410
It does seem that pH may follow the same trend as alcohol content in regards to quality rating, as the mean pH is lowest at ranking 5, then slowly increases with each increase in ranking from 5 or each decrease in ranking from 5. This is such a small difference, it is unlikely to be significant.
Let’s look at Quality Ranking and Residual Sugars:
There isn’t really a pattern here, which answers my question about whether the experts rated sweeter wines more poorly due to the fact that Vinho Verde wines are supposed to be dry. It seems that the sweetness did not have a large effect on rating.
Let’s also zoom in on density:
So once again, we see a trend when going above rating 5, as mean density decreases with each increase in quality ranking. As we know that Density also decreases with increasing alcohol content, really this boxplot is telling us that quality ranking increases with increasing alcohol content.
Some features such as alcohol content (and to a lesser degree, pH), seem to have a relationship with the output variable, Quality Ranking.
There were definitely some features that are correlated, which can be explained by their real-life relationships eg. Density to alcohol content and residual sugar content.
The relationship between Density and Residual Sugar Level was the strongest relationship I found with a Pearson’s R of 0.839. Second strongest was the relationship between alcohol content and Density at -0.78. In terms of the output variable, Quality Ranking, the strongest pattern that I found was with alcohol content.
Let’s plot the relationship between Density, Alcohol Content and Residual Sugars
It makes sense that the wines with the highest alcohol content and the lowest residual sugar content have the lowest density, and vice versa.
Let’s look at quality in relation to Density and Residual Sugar content:
And by alcohol content?
And Alchol Content vs Residual Sugar content vs Quality Rating:
Broken up into the highest and lowest rated wines:
These plots once again show that there is a relationship between density and alcohol content and a relationship between alcohol content and quality rating as the rating tends to increase as alcohol content goes up and density goes down. This is independant of the Residual Sugar content.
Okay, let’s also see if we can discern a relationship between Free Sulfur Dioxide, Total Sulfur Dioxide and Sulphates:
So it still looks like there is a relationship between Free and Total Sulfur Dioxide, and they are completely independant of Sulphate Level.
Although alcohol content is closely related to Residual Sugar content and density of the wine, combining those three variables did not necessarily act as a stronger indicator of quality rating than just alcohol content on it’s own.
I found that the relationship between Residual Sugars, Alcohol content and Density of the wine was an interesting interaction. I was surprised that there wasn’t a strong relationship between Free Sulfur Dioxide, Total Sulfur Dioxide and Sulphates.
This bar graph shows the distribution of Quality Ranking count frequency of the wines sampled. It shows that wines were ranked as a 6 most commonly, and most wines were ranked in the 5-7 range. No wines were ranked as a 1, 2 or 10. In summary, most of the wines in the data set were “Average” according to the experts.
This box plot shows the distribution of alcohol contents of the wines, sorted by each quality ranking level. It shows that wines with the lowest alcohol contents were more likely to be rated as a 5, and wines with the highest alcohol contents were more likely to be rated as higher quality.
This scatter plot shows the relationship between density, residual sugar level and alcohol content. As you can see, the wines with the highest alcohol content and lowest residual sugar content have the lowest density, and vice versa.
In analysing this data, the strongest relationship that I found with quality ranking was alcohol content. In a way this is a good thing as alcohol content is the only variable in this data set that is readily available at the liquor store when buying wine. My biggest struggle with this data set is that I feel it is incomplete in ways of creating a model to guide consumer purchases, which is what I feel the best use case for this type of information would be. Some information that would be readily available at time of purchase would be beneficial (eg. vintage, type of wines, region, etc.). I found that visualizing the relationship between residual sugars, alcohol content and density was my favorite part of this analysis. It was quite satisfying to see clear patterns and relationships in my plots. It was disappointing to me when patterns and relationships I expected to be there weren’t there. I think that it would be interesting to use machine learning or a neural network to create a prediction of quality ranking based on this data set.